Prosper Loan Data - An Exploration

by Meenakshi Abbaraju

Prosper Marketplace, Inc. is a San Francisco, California-based company in the peer-to-peer lending industry. Prosper Funding LLC, one of its subsidiaries, operates Prosper.com, a website where individuals can either invest in personal loans or request to borrow money. The Prosper Loan Dataset was provided by Udacity. Here, I attempt to explore the data and present visualizations after analysis.

I began by loading all the libraries required for the analysis followed by reading the csv file.

Structure of Loan Data

It was found that data frame consists of 113937 observations of 81 variables. This is an enormous dataset. I decided to choose certain columns and work exclusively on them. My new dataframe consists of 21 variables. The variables chosen were: ‘Term’, ‘LoanStatus’, ‘BorrowerAPR’, ‘BorrowerRate’, ‘ListingCategory..numeric.’, ‘BorrowerState’, ‘Occupation’, ‘EmploymentStatus’, ‘EmploymentStatusDuration’, ‘IsBorrowerHomeowner’, ‘CreditScoreRangeLower’,‘CreditScoreRangeUpper’, ‘CurrentCreditLines’, ‘AvailableBankcardCredit’, ‘DebtToIncomeRatio’, ‘IncomeRange’, ‘IncomeVerifiable’, ‘StatedMonthlyIncome’, ‘LoanOriginalAmount’, ‘LoanOriginationDate’, ‘MonthlyLoanPayment’.

Univariate Plots Section

## 'data.frame':    113937 obs. of  21 variables:
##  $ Term                     : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus               : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ BorrowerAPR              : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate             : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ ListingCategory..numeric.: int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState            : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation               : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus         : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner      : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CreditScoreRangeLower    : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper    : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ CurrentCreditLines       : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ AvailableBankcardCredit  : num  1500 10266 NA 30754 695 ...
##  $ DebtToIncomeRatio        : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange              : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable         : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome      : num  3083 6125 2083 2875 9583 ...
##  $ LoanOriginalAmount       : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate      : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ MonthlyLoanPayment       : num  330 319 123 321 564 ...

I will set “Term” variable as factors so as to be able to plot it as discrete values. I created a Status variable which stores the factored Loan Status. The other two variables created were:

PayByFees : Log of Ratio of StatedMonthlyIncome to MonthlyLoanPayment FeesByPay: Ratio of Monthly Loan Payment to StatedMonthlyIncome

##  Term                       LoanStatus     BorrowerAPR     
##  12: 1614   Current              :56576   Min.   :0.00653  
##  36:87778   Completed            :38074   1st Qu.:0.15629  
##  60:24545   Chargedoff           :11992   Median :0.20976  
##             Defaulted            : 5018   Mean   :0.21883  
##             Past Due (1-15 days) :  806   3rd Qu.:0.28381  
##             Past Due (31-60 days):  363   Max.   :0.51229  
##             (Other)              : 1108   NA's   :25       
##   BorrowerRate    ListingCategory..numeric. BorrowerState  
##  Min.   :0.0000   Min.   : 0.000            CA     :14717  
##  1st Qu.:0.1340   1st Qu.: 1.000            TX     : 6842  
##  Median :0.1840   Median : 1.000            NY     : 6729  
##  Mean   :0.1928   Mean   : 2.774            FL     : 6720  
##  3rd Qu.:0.2500   3rd Qu.: 3.000            IL     : 5921  
##  Max.   :0.4975   Max.   :20.000                   : 5515  
##                                             (Other):67493  
##                     Occupation         EmploymentStatus
##  Other                   :28617   Employed     :67322  
##  Professional            :13628   Full-time    :26355  
##  Computer Programmer     : 4478   Self-employed: 6134  
##  Executive               : 4311   Not available: 5347  
##  Teacher                 : 3759   Other        : 3806  
##  Administrative Assistant: 3688                : 2255  
##  (Other)                 :55456   (Other)      : 2718  
##  EmploymentStatusDuration IsBorrowerHomeowner CreditScoreRangeLower
##  Min.   :  0.00           False:56459         Min.   :  0.0        
##  1st Qu.: 26.00           True :57478         1st Qu.:660.0        
##  Median : 67.00                               Median :680.0        
##  Mean   : 96.07                               Mean   :685.6        
##  3rd Qu.:137.00                               3rd Qu.:720.0        
##  Max.   :755.00                               Max.   :880.0        
##  NA's   :7625                                 NA's   :591          
##  CreditScoreRangeUpper CurrentCreditLines AvailableBankcardCredit
##  Min.   : 19.0         Min.   : 0.00      Min.   :     0         
##  1st Qu.:679.0         1st Qu.: 7.00      1st Qu.:   880         
##  Median :699.0         Median :10.00      Median :  4100         
##  Mean   :704.6         Mean   :10.32      Mean   : 11210         
##  3rd Qu.:739.0         3rd Qu.:13.00      3rd Qu.: 13180         
##  Max.   :899.0         Max.   :59.00      Max.   :646285         
##  NA's   :591           NA's   :7604       NA's   :7544           
##  DebtToIncomeRatio         IncomeRange    IncomeVerifiable
##  Min.   : 0.000    $25,000-49,999:32192   False:  8669    
##  1st Qu.: 0.140    $50,000-74,999:31050   True :105268    
##  Median : 0.220    $100,000+     :17337                   
##  Mean   : 0.276    $75,000-99,999:16916                   
##  3rd Qu.: 0.320    Not displayed : 7741                   
##  Max.   :10.010    $1-24,999     : 7274                   
##  NA's   :8554      (Other)       : 1427                   
##  StatedMonthlyIncome LoanOriginalAmount          LoanOriginationDate
##  Min.   :      0     Min.   : 1000      2014-01-22 00:00:00:   491  
##  1st Qu.:   3200     1st Qu.: 4000      2013-11-13 00:00:00:   490  
##  Median :   4667     Median : 6500      2014-02-19 00:00:00:   439  
##  Mean   :   5608     Mean   : 8337      2013-10-16 00:00:00:   434  
##  3rd Qu.:   6825     3rd Qu.:12000      2014-01-28 00:00:00:   339  
##  Max.   :1750003     Max.   :35000      2013-09-24 00:00:00:   316  
##                                         (Other)            :111428  
##  MonthlyLoanPayment             Status        PayByFees    
##  Min.   :   0.0     Defaulted      :17010   Min.   : -Inf  
##  1st Qu.: 131.6     Past Due       : 2067   1st Qu.:2.543  
##  Median : 217.7     Current or Paid:94855   Median :3.003  
##  Mean   : 272.5     Cancelled      :    5   Mean   :  NaN  
##  3rd Qu.: 371.6                             3rd Qu.:3.555  
##  Max.   :2251.5                             Max.   :  Inf  
##                                             NA's   :15     
##    FeesByPay            Due        
##  Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.02859   1st Qu.:0.0000  
##  Median :0.04962   Median :0.0000  
##  Mean   :    Inf   Mean   :0.1619  
##  3rd Qu.:0.07865   3rd Qu.:0.0000  
##  Max.   :    Inf   Max.   :1.0000  
##  NA's   :15

Exploration of the variables

Term

It is seen from the plot that there are three terms viz. 1, 3 & 5 years. The three-year term is the most common.

Borrower Rate & APR:

The plots are more or less similar, skewed on the right. The peak is at 0.36.
0.3177 is the most common rate which was found by grouping and summarising on the Borrower Rate.

## # A tibble: 5 × 2
##   BorrowerRate     n
##          <dbl> <int>
## 1       0.3177  3672
## 2       0.3500  1905
## 3       0.3199  1651
## 4       0.2900  1508
## 5       0.2699  1319

Debt To Income Ratio:

Most of the values lie between 0 and 1 in the histogram for DebtToIncome Ratio.

Being curious, I facet-wrapped it by Income range and noticed that the $1-25K income range has a few ratios which exceed 1. This shows that people of that group have the maximum DebtToIncome ratio. This is not abnormal. The plot corroborated what I had guessed out of common knowledge.

LoanOriginationDate:

I have created two new variables(using lubridate) to store the Year and Month of the loan creation date.There is a dip in 2008-2009. Slowly, the loans again started to pick up in 2011 and 2012. The Loans as per the histogram peaked in the latter part of 2013 and earlier 2014.

## # A tibble: 3 × 5
##     Term   LoanSum  LoanMean LoanMedian Count
##   <fctr>     <dbl>     <dbl>      <dbl> <int>
## 1     36 638686342  7276.155       5000 87778
## 2     60 303631410 12370.398      11500 24545
## 3     12   7576595  4694.297       3500  1614

Employment Status Duration:

There is a positive skew to the plot. The square root is applied to get a clearer picture of the skew. More people have shorter employment duration. This is definitely not good, since the chance of repayment of debt decreases.

Credit Score Ranges (lower and upper):

Both the credit score ranges produce, more or less a similar plot. I took a closer look by setting xlim for both lower and upper credit scores. All the distributions are normal.

Current Credit Lines & Available Bankcard Credit:

The current credit lines faceted by Income Range doesn’t show anything concrete except that it is highest in $25 K to $75 K. For the logical flow of discussion, I added these facet-wrapped plots here, though technically they are not univariate.

Stated Monthly Income:

The following are six different plots providing different ranges of Stated Monthly Income.

Monthly Loan Payment:

The monthly loan payment peaks at around $200-300. This is quite normal. We do not expect them to run in thousands as no one capable of paying that amount would apply for a loan(under normal circumstances).

Original Loan Amount:

The plot is right skewed, peaking at $4000, $10000, $15000, $20000 $25000. The mean Loan amount was found to be $8337.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

Listing Category:

The Listing Category represents the cause for taking the loan for eg, Wedding or vacation or Student use etc.

Loan Status:

I created a new variable to hold a list of vector values which give the Loan Status in three ways - Open, Closed and Cancelled.

## Cancelled    Closed      Open 
##         5     55084     58848

Income Range:

Maximum number of people fall in the $25K-$50K and $50K-$75K ranges. Sadly, many people do not have their incomes listed and they fall in the “Not Displayed” range. It is interesting to note that such a vast number of people in $75K-100K and $100K+ also applied for loans. We would normally not expect to see such results. We would generally, expect such high counts in the $1-25K range.

Employment Status:

Most of the borrowers are employed.

Summarizing the Loan Amount and arranging on descending order of Mean Loan Amount, it was found that DC topped the list.

## # A tibble: 6 × 3
##   BorrowerState Mean_loan Mean_monthly
##          <fctr>     <dbl>        <dbl>
## 1            DC 10125.751     337.9405
## 2            NJ  9529.019     303.0464
## 3            MA  9514.832     303.7583
## 4            NH  9463.040     310.1752
## 5            AK  9311.805     305.0123
## 6            MD  9249.990     300.0380

Summarizing the Mean Monthly Loan payment also gave me the same order of states as when arranged by Loan Amount. This further establishes the fact that monthly loan payment has a direct, strong positive relation with Loan Payment and is a candidate for a linear model.

The mean Loan Amount and the mean MonthlyPayment is highest in DC. The number of loans is highest in CA; thus, large amount of money is involved there. So also, a large amount of money is involved in DC. This is what instigated me to compare the two states.

## # A tibble: 6 × 2
##   BorrowerState Mean_monthly
##          <fctr>        <dbl>
## 1            DC     337.9405
## 2            NH     310.1752
## 3            AK     305.0123
## 4            MA     303.7583
## 5            NJ     303.0464
## 6            MD     300.0380

While the maximum mean Loan Amount is from DC the maximum number of loans come from CA, FL, NY & TX

I created a dataframes for CA and DC for exploration.

Univariate Analysis

What is the structure of your dataset?

It was found that data frame consists of 113937 observations of 81 variables. This is an enormous dataset. I decided to choose certain columns and work exclusively on them. My new dataframe consists of 21 variables. The variables chosen were:

Continuous Variables:

EmploymentStatusDuration, CreditScoreRangeLower, CreditScoreRangeUpper, CurrentCreditLines,AvailableBankcardCredit, DebtToIncomeRatio, StatedMonthlyIncome, BorrowerAPR BorrowerRate, LoanOriginalAmount, MonthlyLoanPayment

Discrete Variables:

BorrowerState, Occupation, EmploymentStatus, IsBorrowerHomeowner, IncomeRange, IncomeVerifiable, Term ListingCategory..numeric, LoanStatus, LoanOriginationDate

The Loan terms are 1, 3 and 5 years. Based on the histogram, the three year term is the most common.

What is/are the main feature(s) of interest in your dataset?

I am interested in finding the relationship between Income, Loan Status, Borrower Rate, DebtToIncomeRation & Monthly Loan Payment. There are definitely other variables which will further effect these.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

  • The repayment of loans depends heavily on Income, Monthly Loan Payment, interest rate, Employment status, Debt to Income ratio and Loan Status, Homeowner status. The amount of loan recieved in the first place also depends on these factors. I would like to study the corealtion among these variabes and produce visualizations to see if there exists any relationship. These are the variables I will be exploring further.

  • I want to compare and contrast the states DC and CA. A large amount of money goes to CA because there are large number of borrwers. A large amount of money is also going out to DC since the mean Loan Original Amount is highest in DC. Hence, I would like to compare these two.

  • I also want to zoom in on borrowers with their Loan Status based on whether they are Homeowners or not. I would like to know how the Borrower Rate is different for them.

Did you create any new variables from existing variables in the dataset?

The following new variables were created:

Status: The Status being ’Defaulted, PastDue, Current or Paid and Cancelled

LoanOriginationMonth: The Month the loan was taken

LoanOriginationYear: The Year the loan was taken

ListingCategory: The categories of Loans like ‘Home Improvement’, ‘Wedding Loan’ etc. This variable was numneric earlier. It was converted by using ‘factor’.

LoanStatusBucket: The loan status being - Cancelled, Open, Closed

Due: Contains all the Past Due, Defaulted and Chargedoff values

PayByFees: Log(StatedMonthlyIncome/MonthlyLoanPayment)

FeesByPay: onthlyLoanPayment/StatedMonthlyIncome

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The two numberic variables ‘Term’ and ‘Listing category’ were set to factors to make them discrete when plotted. I did not tidy or adjust the data apart from creating a few variables. I haven’t found anything unusual so far. The people with income range from $25K to $75K seem to take maximum loans. Monthly Loan payment is also highest for the same group. Maximum number of Open loans were from mid 2012 to mid 2014. Majority of the loans are debt consolidation category. The other significant ones are home improvement and business loans.

I arranged the data by descending order of Loan Amount and grouped them by Borrower State. I noticed that DC ranked highest in Mean Loan Amount. Also, CA has the highest number of loans taken. I was curious to find any relationship, if exists, between a state with highest number of loans to a state with highest mean loan amount taken. I will study this in Multivariate section.

Bivariate Plots Section:

I start the Bivariate Analysis with plotting Loan Amount against Monthly Loan Payment. A linear pattern with 3 lines is noticed.I added jitter and transperancy to the plot. I also removed the outliers. It was seen that there is a strong corelation between the two. The corelation test produced a very high positive score of 0.93. Next, I took a subset of the data where Loan Amount is less than or equal to $10k. In this case the correalation factor with Monthly Payment is a strong positve 0.88.

## 
##  Pearson's product-moment correlation
## 
## data:  pl$LoanOriginalAmount and pl$MonthlyLoanPayment
## t = 867.82, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9312165 0.9327426
## sample estimates:
##       cor 
## 0.9319837
## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and MonthlyLoanPayment
## t = 547.35, df = 82958, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8834713 0.8864227
## sample estimates:
##       cor 
## 0.8849559

Next, I plotted Credit Score (lower) and Monthly Payment. I noticed a very thick plot in the 500 to 800 range. After zooming in and adding transperancy the corelation test was run. The corealtion was weak and positive.

## 
##  Pearson's product-moment correlation
## 
## data:  pl$CreditScoreRangeLower and pl$MonthlyLoanPayment
## t = 102.99, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2871995 0.2978465
## sample estimates:
##      cor 
## 0.292532

There is a negative corelation between the Credit Score vs Borrower Rate and is weak.

## 
##  Pearson's product-moment correlation
## 
## data:  pl$CreditScoreRangeLower and pl$BorrowerRate
## t = -175.17, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4661358 -0.4569730
## sample estimates:
##        cor 
## -0.4615667

There is a negative relationship between Loan Amount and rate offered. The number and range of rates is decreasing as the Loan Amount increases. It is a weak corelation.

## 
##  Pearson's product-moment correlation
## 
## data:  pl$LoanOriginalAmount and pl$BorrowerRate
## t = -117.58, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3341283 -0.3237719
## sample estimates:
##        cor 
## -0.3289599

I used the 2 new variables FeesByPay and PayByFees in scatter plots to see how they are connected to DebtToIncoemRatio. There is initially a dip and then a rise in the line.

I have not been able to get much out of these variables by plotting it agains Debt To Income Ratio.

The debt to Income ratio increases and later decreases as the FeesByPay ratio increases. I could not understand this pattern.

It was noticed that many fell in the “Debt Consolidation” category. This is not a very clear cut category as it could mean any of the other loans. The next high values were found in Home Improvement and Business Loan.

Most of the loans are ‘Current’ followed by those ‘Completed’. This makes sense, since most loans are three year term; they peaked in 2013. So, they become past due only in 2016.(the data of which we don’t have).

The following plot is Monthly Payment filled by Loan Status. I see that very few are past due. Many are current loans.

Those that have ‘Completed’ Loan status have greater incomes than others. I facet_wrapped it with IsBorrowerHomeOwner. People with homes seem to have more ‘Completed’ status than anyone else. Also, the current loans is high in Home owners.

Also, the Debt To Income Ratio for those with ‘Completed’ status is less which is obvious.

Those with high incomes, as expected, have lower Debt to income ratio.

There is a very weak positive corelation between Stated Monthly Income and Monthly pay. This means that even when people are earning more, they prefer to keep the monthly installments low.

## 
##  Pearson's product-moment correlation
## 
## data:  pl$StatedMonthlyIncome and pl$MonthlyLoanPayment
## t = 67.764, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1912423 0.2024055
## sample estimates:
##       cor 
## 0.1968303

For incomes less than 50000 there is a moderately positive linear corealtion between Income and Loan Amount

## 
## Call:
## lm(formula = LoanOriginalAmount ~ StatedMonthlyIncome, data = subset(pl, 
##     StatedMonthlyIncome < 50000 & StatedMonthlyIncome > 10))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -31406  -4066  -1585   3244  24835 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.667e+03  3.074e+01   151.8   <2e-16 ***
## StatedMonthlyIncome 6.598e-01  4.559e-03   144.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5732 on 112137 degrees of freedom
## Multiple R-squared:  0.1574, Adjusted R-squared:  0.1574 
## F-statistic: 2.094e+04 on 1 and 112137 DF,  p-value: < 2.2e-16
## 
##  Pearson's product-moment correlation
## 
## data:  log(sal$StatedMonthlyIncome) and sal$LoanOriginalAmount
## t = 155.85, df = 112140, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4171193 0.4267411
## sample estimates:
##       cor 
## 0.4219421

There is no linear relationship between Credit Score and Income.

## 
## Call:
## lm(formula = log(NewMonthlyIncome$StatedMonthlyIncome) ~ NewMonthlyIncome$CreditScoreRangeUpper)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6808 -0.3458  0.0169  0.3685  5.7998 
## 
## Coefficients:
##                                         Estimate Std. Error t value
## (Intercept)                            6.864e+00  1.929e-02  355.86
## NewMonthlyIncome$CreditScoreRangeUpper 2.255e-03  2.726e-05   82.73
##                                        Pr(>|t|)    
## (Intercept)                              <2e-16 ***
## NewMonthlyIncome$CreditScoreRangeUpper   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6035 on 111629 degrees of freedom
##   (578 observations deleted due to missingness)
## Multiple R-squared:  0.05777,    Adjusted R-squared:  0.05776 
## F-statistic:  6845 on 1 and 111629 DF,  p-value: < 2.2e-16
## 
##  Pearson's product-moment correlation
## 
## data:  NewMonthlyIncome$CreditScoreRangeUpper and log(NewMonthlyIncome$StatedMonthlyIncome)
## t = 82.732, df = 111630, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2348253 0.2458799
## sample estimates:
##       cor 
## 0.2403604

Most of the borrowers are employed and the rate ranges betwenn 0.1 and 0.3 for the employed ones.

Median Borrower Rate is inversely proportional to Income Range.

As expected the median Loan Amount increases with the Income range.

There is no linear relation between the current credit lines and Debt to Income Ratio. I was expecting an inverse relation.

It is surprising to see that borrowers with high Debt to Income Ratio have the same borrower rates as others.

The borrower rate is lower for a home owner.

By plotting a LoanOriginationYear histogram, it was seen that maximum of the Due loans were found in 2013 and 2014. Also, most of these were 3-year term loans. These “Due” values were found to peak at Loan Amounts of $6000, $11,000 and $16000. By introducing limits and breaks, it was found that the loan amount peaked at - $1000, $2000, $4000, $5000,$10000 & $15000.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

There is a weak positive corealtion between Credit Score and MonthlyLoanPayment. There is a negative corelation between the Credit Score and Borrower Rate and is weak. It cannot be said if the Credit Score really affects the Borrower Rate. I guess the weak linear fit in the graph is due to some influencer outlier points. I was expecting a strong inverse relation between the two.

There is a non lienar relationship between Loan Amount and rate offered. The number and range of rates is decreasing as the Loan Amount increases.

I also noticed that those that have completed Loan status have higher incomes than others. This again is a very obvious thing. What is surprising is that people with homes have the highest ‘Completed’ loan status. I am assuming they don’t have home mortgages to pay!!!

The ‘Current’ loan Status also is high for Homeowners. This is probably to pay for the home.

There is a very weak positive corelation between ‘Stated Monthly Income’ and ‘Monthly Loan Payment’. This means that even when people are earning more, they prefer to keep the monthly installments low. I expected that people with higher incomes would pay off their loans by paying more amount monthly and get done.

For incomes less than $50000 there is a moderately positive linear corealtion between Income and Loan Amount.

The median Borrower rate decresses with increase in income. Median loan amount increases with income

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I found that the Borrower Rate is lower for a HomeOwner. At the same time, Borrower Rate does not seem to be related to the DebtTo Income ratio, since theose with high ratio also have the same BorrowerRate. I was expecting an inverse relation between Current Credit lines and Debt to Income ratio but the plot showed that there is absolutely no relation. Most people with ‘Completed’ loan status are homeowners. I think they have payed their mortgages fully or have been owning the house for a very long time. Those with ‘Current’ loan status are also homeowners. This is understandable as they probably took loan to pay off their mortgage or they took a loan because a heavy amount is going towards mortgage and they need money for other things. The next thing I found was that in the highest income range people took loans for “Household expenses”, cosmetic procedures and boats.

What was the strongest relationship you found?

The strongest relationship was between Loan Original Amount and Monthly loan payment. The corelation test produced a very high positive score of 0.93. Next, I took a subset of the data where Loan Amount is less than or equal to $10k. In this case the correalation factor with Monthly Payment is a positve 0.88.

Multivariate Plots Section

$10k, $15K, $ 20K and $ 25K loans in a three year term are very common.

I do not see much difference in plots of DebtToIncomeRatio vs StatedMonthlyIncome colored by Loan Status and facet-wrapped by Homeowner variable. Whether the borrower is a homeowner or not, the distribution is almost same. This made me wonder howcome the home loan is not causing any change.

Full time employed people get maximum loan amount.

The loan amounts are large as the term increases.

Borrower Rate is maximum for the 3 year term.

The credit score is high for those with one year term.

The Loan amounts are highest for green loans, baby & adoption and business. The loan amount is lowest for students.

Baby Adoption, cosmetic procedure, green loans Household expenses and student loans are the loans for which the Borrower Rate is high. So, for the students the loan amount is the lowest and the borrower rate is the highest.

Business Loans have been defaulted quite a bit.

Maximum defaulters are in 3-year term. Seems right, since maximum number of laons is also in 3 year term.

People employed Fulltime have defaulted quite a bit. This is surprising because the Borrower rate is low for full time employees. They are kind of given privelege.

Nothing conclusive can be said about the loan status of a home owner vs a non home owner wrt to Loan Amount.

Loan Amount vs Credit Score produces a normal distribution.

There is a weak positive corelation between Credit score and Loan amount

## 
##  Pearson's product-moment correlation
## 
## data:  pl$CreditScoreRangeLower and pl$LoanOriginalAmount
## t = 122.07, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3357190 0.3460095
## sample estimates:
##       cor 
## 0.3408745

The following is a comparison between the two states CA and DC

As loan amount increases, the Debt to Income ratio of DC is slightly less than CA.

It is noticed that DC loans are equally distributed in the three terms and Debt to Income ratio is almost same accross terms. In CA however, the maximum number of loans are in 3 year term and there are many levels in Debt to Income Ratio.

The median “Past Due” status is higher in DC.

Those with ‘Past Due’ in DC have taken bigger loans than those in CA. Even the defaulted ones in DC have borrowed larger amounts than CA.

In general, Home owners have taken bigger Loans than non home owners. DC leads CA in the Loan amount among both Home owners and non owners.

The employment condition of DC seems to be good. None of the borrowers is unemployed. I wonder why they are not paying their Dues then! CA has quite a number of unemployed borrowers. The Debt To Income ratio is pretty much consistent for DC. CA has many outliers.

The same information is now depicted in a histogram of DC and CA combined

This is a freqpoly of the monthly loan payment of both states DC and CA combined

This graph presents queer curves which are tough for me to decipher.

DC presents a fairly consistent pictue. Whether you own a home or not, you are pretty much the same in terms of Debt To Income Ratio. CA has many outliers but generally, the bulk of the data has less Debt to Income Ratio.

The monthly loan payment of employed people in DC is higher than those of CA

The Debt to Income Ratio is almost same accross all kinds of Employment Status in both DC and CA. This means there is no corealtion between the two.

With higher employment in DC, the past due status seems to be higher there.This is very unusual since emloyment rate is very high.

BorrowerRate is high in CA and highest for unemployed people.

I wanted to study how Borrower Rate is affected Employment Status. The borrower rate for CA is fluctuates more.

I have not been able to find a strong corealtion between Borrower Rate and any of the variables I researched on.

## 
##  Pearson's product-moment correlation
## 
## data:  pl$BorrowerRate and pl$CreditScoreRangeLower
## t = -175.17, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4661358 -0.4569730
## sample estimates:
##        cor 
## -0.4615667
## 
##  Pearson's product-moment correlation
## 
## data:  pl$BorrowerRate and pl$AvailableBankcardCredit
## t = -119.44, df = 106390, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3491486 -0.3385518
## sample estimates:
##        cor 
## -0.3438611
## 
##  Pearson's product-moment correlation
## 
## data:  pl$BorrowerRate and pl$LoanOriginalAmount
## t = -117.58, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3341283 -0.3237719
## sample estimates:
##        cor 
## -0.3289599

I used facetwrap by loan status to plot a few graphs and filtered the data on the basis of Homeowner Status. I wanted to find out how the Homeowner status affects the borrower.

The plots are similar for all the Non Home owners, meaning there seems to be no relation between Loan amount and Monthly Income across any Loan Status for Non Home Owners.

The plots are similar for all the Home owners, meaning there seems to be no relation betwwen Loan amount and Monthly Income across any Loan Status for Home Owners.

Across all Non Home owners, there is an inverse relation between DebtToIncome Ratio and Monthly Income.

On the other hand the relation is not so pronounced in the case of Home Owners. This made me think that the home owners probably have payed off their mortgage because of which their Debts are considerably low.

For both Home Owners and Non home owners, there seems to be no relation between Monthly Income and Borrower Rate.

Visualizations giving the count of Loan Status for both home owners and non home owners.

The Loan Status in CA and DC is more or less the same whether one is a home owner or not.

I did the same procedure on both states individually. In the CA plot there does not seem to be much change. It almost looks like the previous plots.

There seems to be difference in output when I did the same for DC. More homeowners have current or completed status. So, owning a home has somehow affected the Loan Status.

Linear Modeling

1. Monthly Loan Payment & LoanOriginalAmount

A very favorable r-squared value of 0.86 is obtained. The residuals vs. fitted plot shows the residuals around the zero line forming a thich horizontal band. This is a very desirable model showing a strong linear relation.

## numeric(0)
## 
## Call:
## lm(formula = MonthlyLoanPayment ~ LoanOriginalAmount, data = pl)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -751.60  -23.45   -1.04   25.94 1499.91 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.275e+01  3.452e-01    94.9   <2e-16 ***
## LoanOriginalAmount 2.875e-02  3.313e-05   867.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 69.85 on 113935 degrees of freedom
## Multiple R-squared:  0.8686, Adjusted R-squared:  0.8686 
## F-statistic: 7.531e+05 on 1 and 113935 DF,  p-value: < 2.2e-16
##                          2.5 %      97.5 %
## (Intercept)        32.07750073 33.43049805
## LoanOriginalAmount  0.02868897  0.02881886

2. Borrower Rate and StatedMonthlyIncome

## 
## Call:
## lm(formula = BorrowerRate ~ StatedMonthlyIncome + DebtToIncomeRatio + 
##     CreditScoreRangeLower, data = pl)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51576 -0.04838 -0.01086  0.04310  0.49290 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            5.678e-01  2.095e-03  271.00   <2e-16 ***
## StatedMonthlyIncome   -7.859e-07  4.102e-08  -19.16   <2e-16 ***
## DebtToIncomeRatio      6.640e-03  3.614e-04   18.37   <2e-16 ***
## CreditScoreRangeLower -5.469e-04  3.079e-06 -177.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06419 on 104794 degrees of freedom
##   (9139 observations deleted due to missingness)
## Multiple R-squared:  0.2477, Adjusted R-squared:  0.2477 
## F-statistic: 1.15e+04 on 3 and 104794 DF,  p-value: < 2.2e-16
##                               2.5 %        97.5 %
## (Intercept)            5.636862e-01  5.718991e-01
## StatedMonthlyIncome   -8.663042e-07 -7.055148e-07
## DebtToIncomeRatio      5.931757e-03  7.348428e-03
## CreditScoreRangeLower -5.529080e-04 -5.408383e-04

3. BorrowerRate and EmploymentStatus

## 
## Call:
## lm(formula = BorrowerRate ~ EmploymentStatus, data = pl)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.234079 -0.058791 -0.008791  0.057809  0.311957 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    0.185543   0.001569 118.270  < 2e-16 ***
## EmploymentStatusEmployed       0.007247   0.001595   4.544 5.52e-06 ***
## EmploymentStatusFull-time      0.001463   0.001635   0.895  0.37082    
## EmploymentStatusNot available  0.005949   0.001871   3.180  0.00147 ** 
## EmploymentStatusNot employed   0.058536   0.003018  19.396  < 2e-16 ***
## EmploymentStatusOther          0.028153   0.001980  14.221  < 2e-16 ***
## EmploymentStatusPart-time     -0.001143   0.002750  -0.416  0.67770    
## EmploymentStatusRetired        0.008899   0.003073   2.896  0.00378 ** 
## EmploymentStatusSelf-employed  0.016725   0.001835   9.116  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0745 on 113928 degrees of freedom
## Multiple R-squared:  0.008622,   Adjusted R-squared:  0.008552 
## F-statistic: 123.9 on 8 and 113928 DF,  p-value: < 2.2e-16
##                                      2.5 %      97.5 %
## (Intercept)                    0.182468328 0.188618014
## EmploymentStatusEmployed       0.004121490 0.010373322
## EmploymentStatusFull-time     -0.001740872 0.004666506
## EmploymentStatusNot available  0.002282979 0.009615643
## EmploymentStatusNot employed   0.052620574 0.064450689
## EmploymentStatusOther          0.024272787 0.032033305
## EmploymentStatusPart-time     -0.006532745 0.004246955
## EmploymentStatusRetired        0.002876168 0.014921515
## EmploymentStatusSelf-employed  0.013129575 0.020321351

4. BorrowerRate and CreditScore

## 
## Call:
## lm(formula = BorrowerRate ~ CreditScoreRangeLower, data = pl)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.49873 -0.05051 -0.01165  0.04585  0.21868 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            5.487e-01  2.041e-03   268.9   <2e-16 ***
## CreditScoreRangeLower -5.190e-04  2.963e-06  -175.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0663 on 113344 degrees of freedom
##   (591 observations deleted due to missingness)
## Multiple R-squared:  0.213,  Adjusted R-squared:  0.213 
## F-statistic: 3.068e+04 on 1 and 113344 DF,  p-value: < 2.2e-16
##                              2.5 %       97.5 %
## (Intercept)            0.544730239  0.552730443
## CreditScoreRangeLower -0.000524841 -0.000513226

BorrowerRate and Loan Amount

## 
## Call:
## lm(formula = BorrowerRate ~ LoanOriginalAmount, data = pl)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.221676 -0.053914 -0.001008  0.055049  0.283705 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.256e-01  3.491e-04   646.3   <2e-16 ***
## LoanOriginalAmount -3.941e-06  3.351e-08  -117.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07065 on 113935 degrees of freedom
## Multiple R-squared:  0.1082, Adjusted R-squared:  0.1082 
## F-statistic: 1.383e+04 on 1 and 113935 DF,  p-value: < 2.2e-16
##                            2.5 %        97.5 %
## (Intercept)         2.249327e-01  2.263012e-01
## LoanOriginalAmount -4.006297e-06 -3.874924e-06

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

There was not much difference in plots of DebtToIncomeRatio vs StatedMonthlyIncome colored by Loan Status and facet-wrapped by Homeowner variable. Whether the borrower is a homeowner or not, the distribution is almost same.

Full time employed people get maximum loan amount. The loan amounts are large as the term increases.

Borrower Rate is maximum for the 3 year term. Baby Adoption, cosmetic procedure, green loans Household expenses and student loans are the loans for which the Borrower Rate is high. I have not been able to find a corealtion between Borrower Rate and any of the variables I researched on.

The credit score is high for those with one year term. There is a weak positive corelation between Credit score and Loan amount

Maximum defaulters are in 3-year term. Business Loans have been defaulted quite a bit. Fulltime employed people have defaulted more. This is surprising because the Borrower rate is low for full time employees. They are kind of given privelege.

DC leads CA in the Loan amount among both Home owners and non owners. DC presents a fairly consistent pictue. Whether you own a home or not, you are pretty much the same in terms of Debt To Income Ratio. CA has many outliers but generally bulk of the data has less Debt to Income Ratio. Employment Status is also good in DC. BorrowerRate is high in CA and highest for unemployed people.

The plots are similar for all the Non Home owners, meaning there seems to be no relation between Loan amount and Monthly Income across any Loan Status for Non Home Owners.

Were there any interesting or surprising interactions between features?

The loan amount is least for Student and the borrower rate is the highest. This is unfair! Across all Non Home owners, there is an inverse relation between DebtToIncome Ratio and Monthly Income. On the other hand, the relation is not so pronounced in the case of Home Owners. Maybe, home owners have payed off their mortgage because of which their Debts are considerably low. For both Home Owners and non home owners, there seems to be no relation between Monthly Income and Borrower Rate.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I created five models -

Linear Regression:

  • 1.Monthly Loan Payment and Loan Amount I got an r-squared value of 0.86. This was really the first positive thing I saw in this whole exploration. The residual vs Fit plot is beautiful. There are values on either side of the zero line which is good. The residuals form a horizontal band around the zero line which implies that the variances of the error terms are equal. There are no big outliers.

  • 2.BorrowerRate and CreditScore Rsuared is 0.21. The residual vs Fit plot show a good horizontal group around the zero line. A group of outliers pulls down this line. Maybe dealing with outliers will produce a better outcome. This is one thing I would love to revisit in future.

  • 3.BorrowerRate and Loan Amount The rsquared is 0.10. The residual vs Fit plot shows a good strength around the zero line. I do not understand why the realtion is weak when it looks go good on the graph. :( Maybe, I am missing something. I will revisit these models, once I learn Linear Regression and Machine Learning and correct the mistakes that I possibly made.

Multiple Linear Regression

  • 4.Borrower Rate and CreditScore, DebtToIncomeRatio and StatedMonthlyIncome I got a weak r-squared value of 0.24. Borrower Rate is totally eluding my grasp of this Loan data. The good thing is it is at least slightly corelated to Income, DebtToIncome and Credit Score.

Linear Regression with Categorical variable

  • 5.Borrower Rate and Employment Status I could not perform a cor.test on the categorical variable Employment Status. So I decided to factor it so that I will be able to build a linear model. I came to know that Borrower Rate and Emplyoment Status cannot be fit in a model. The residuals form perpendicular bands with the zero line. I got a very less r-squared value of 0.0086. The Borrower Rate does not seem to be related to Employment Status at all. The plots glaringly present a non-linear model. ——

Final Plots and Summary

Plot One: Loan Amount to Listing Category

Description One

This plot shows that ‘Default’ status is high in Boat, Auto, Business, Home Improvement and ‘Household expenses’ categories. This is surprising since most of these loans are taken by people with Employment. I had created a box plot of income range of borrowers vs the listing category. There, I wanted to find out what kind of loans the people with higher incomes are taking. It was noticed that in the highest income range people are taking loans for “Household expenses”, cosmetic procedures and boats. The above plot shows that the Household Expenses, Boat etc have defaulters. That is wierd, since the borrowers are earning but not repaying the loan.

What I felt is unfair is that Students get very less loan amount. This is offered at a very high Borrower rate.Baby Adoption, cosmetic procedure, green loans Household expenses and student loans are the loans for which the Borrower Rate is high. So, for the students loan, the loan amount is the lowest and the borrower rate is the highest.

I guess a more structured, progressive way should be introduced, where people with more basic, everyday needs are catered to. I am not against vacation loans and boat loans, but it is not reasonable to not provide money to a person willing to study. The borrower rate also has to be reasonable.

Plot Two: Income Range

Description Two

This shows that there are ‘Defaulted’ and ‘Past Due’ status in lower income and higher income ranges. I wonder why people are not paying off the loan on time if they get a good income. Maybe, there is something I am misising here.

As the income increases, the loan amount for sure is increasing, yet, at the same time, the monthly loan payment is not. People probably feel more secure paying small amounts than big chunks. This does not explain why there are defualters in High Income groups.

Plot Three: Loan Status in DC based on Homeowner Status

Description Three

I noticed that owning a home or not in US does not seem to affect the loan status at all. Then i zoomed in on the CA+DC dataframe. In this combined dataframe also, I noticed there is not much change. I was still curious and filtered the data to get individual state graphs. CA also showed me the same graph where there is not much change whether one is a home owner or not.

DC, on the other hand showed that those who own a home have more number of ‘Completed’, ‘Charged off’, & ‘Current’ status. Not only that, there are no ‘Past Due (91-120 days)’. This is good since even if they are due they are not prolonging it and repaying the loan.

Again, the employment status in DC is very good and also the income range is high. It makes sense that they are not much past due.

The homeowners in DC have a better Loan status spectrum than others. Again, the Mean Loan Amount is highest in DC. So, if they are paying back on time, this defintely says a lot about their income ranges. Homeowner status might not be a cause of this but there seems to be a corelation.

Reflection

I have no exposure to Finance sector in general and Peer to Peer Loans system in particular. When I saw the options for the project, I was in two minds whether to choose this dataset or not. I decided to do it, since this would be a wonderful learning experience for me. At the first glimpse of the data, I have to admit, I was intimidated beyond measure seeing the observations and variable numbers. Yet, since it is mere exploration and visualization, with no right and wrong solutions, I decided to tread along.

I reduced the variable number, based on what interested me. The univariate and bivariate graphs gave me a lot of perspective. I used this to furnish the multivariate plots. I have not been able to rightly establish how the variables are related to each other. Yet, the graphs have shown that variables influence each other.

California has the largest number of loans(borrowers) in the company and the mean Loan Amount is highest in DC. So, technically, a lot of money is circulating in these two states. Hence, I compared the variables of these two states and produced some multivariate plots.

I also studied how Homeowner Status affects the Loan Status in US in general, in CA and DC together and again individually in CA and DC.

I performed around 8 corelation tests. I decided to make 5 models using linear modelling. The Monthly Loan Payment and Loan Amount gave me a strong r-squared value of 0.86. The other models have something missing in them. I do not have thorough knowledge in this topic and hence have probably misinterpreted the relationship where none exist at all!?

I summarized my whole exploration by presenting three plots of interest to me. This has been a good learning experience for me. I took up this task with zero knowledge of finance sector and now at least have an amatuer knowledge of how things in the P2P Loan sector. Apart from that, I have been able to put to test what I learnt about R from Udacity. I intend to study this more as and when I get time. The next time around I would like to consider other variables too. I definitely want to explore more on the Linear Modeling section and make necessary corrections based on what I learn from Udacity in the coming sections.